Design of an RSS Crawler with Adaptive Revisit Manager

نویسندگان

  • Bum-Suk Lee
  • Jin Woo Im
  • Byung-Yeon Hwang
  • Du Zhang
چکیده

is widely used for notifying readers of updated information on blogs and feeding news to readers quickly. RSS is very simple, and so is mostly used as a web service. However there is no satisfactory search engine which works for RSS. The reason is that RSS is continuously modified, and the structure of general search engines is ineffective to collect information from RSS sources. In this paper, we discuss a web crawling algorithm, and propose a structure for an RSS crawler which is geared toward collecting and updating RSS in the Web2.0 environment. The proposed method (1) uses visited domain name history to predict the location of the RSS of a new seed URL, and (2) updates RSS information adaptively, based on some update-checking heuristics. These approaches can serve as cornerstones for an efficient and effective RSS search engine.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Personalized RSS Search Service Using RSS Characteristics and User Context

RSS is a one of the most important techniques in Web 2.0. Although there are a lot of RSS feeds available, finding which information are relevant to user isn't easy. The previous RSS search services have not taken into account RSS feed characteristics and user’s contextual information. This seriously limits to offer users with useful information. This paper proposes a new personalized RSS searc...

متن کامل

Automated System for Improving RSS Feeds Data Quality

Nowadays, the majority of RSS feeds provide incomplete information about their news items. The lack of information leads to engagement loss in users. We present a new automated system for improving the RSS feeds’ data quality. RSS feeds provide a list of the latest news items ordered by date. Therefore, it makes it easy for a web crawler to precisely locate the item and extract its raw content....

متن کامل

Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds

Blogs and RSS feeds are becoming increasingly popular. The blogging site LiveJournal has over 11 million user accounts, and according to one report, over 1.6 million postings are made to blogs every day. The “Blogosphere” is a new hotbed of Internet-based media that represents a shift from mostly static content to dynamic, continuously-updated discussions. The problem is that finding and tracki...

متن کامل

RSS-Crawler Enhancement for Blogosphere-Mapping

The massive adoption of social media has provided new ways for individuals to express their opinions online. The blogosphere, an inherent part of this trend, contains a vast array of information about a variety of topics. It is a huge think tank that creates an enormous and ever-changing archive of open source intelligence. Mining and modeling this vast pool of data to extract, exploit and desc...

متن کامل

Automatic Extraction of Complex Web Data

A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008